from IPython.core.display import HTML
# https://stackoverflow.com/questions/32156248/how-do-i-set-custom-css-for-my-ipython-ihaskell-jupyter-notebook
styles = open('./custom_css.css', "r").read()
s = '<style>%s</style>' % styles
HTML(s)



Data Analysis and Visualization
NYC MTA Turnstile Usage Dataset (link - web.mta.info/developers/turnstile.html)data users① Data source is Metropolitan Transportation Authority Web site [link]
② Dataset we use is NYC MTA Turnstile Usage Dataset
NYC MTA Turnstile Usage Dataset
Have a look at / download the following dataset :
- Go to http://web.mta.info/developers/turnstile.html
- This dataset shows entry & exit counter values for each turnstile-device in each station in the NYC Subway System.
- Note these aren’t counts per interval, but equivalent to an “odometer” reading for each device.
🅐 Dataset stores rough (hardware audit level) time-aggregates of usage of gates in the New York City subway system
③ Brief data description
transportation (mass public transit; Aviation, travel, and logistics)subway system, in different parts of the world known as Underground, Metro, U-Bahn, Rapid transit, Heavy rail. New York metropolitan areaMetropolitan Transportation Authority of NYC.① Raw data example is here (ts_Field_Description_pre-10-18-2014.txt):
A002,R051,02-00-00,03-21-10,00:00:00,REGULAR,002670738,000917107,03-21-10,04:00:00,REGULAR,002670738,000917107,03-21-10,08:00:00,REGULAR,002670746,000917117,03-21-10,12:00:00,REGULAR,002670790,000917166,03-21-10,16:00:00,REGULAR,002670932,000917204,03-21-10,20:00:00,REGULAR,002671164,000917230,03-22-10,00:00:00,REGULAR,002671181,000917231,03-22-10,04:00:00,REGULAR,002671181,000917231
A002,R051,02-00-00,03-22-10,08:00:00,REGULAR,002671220,000917324,03-22-10,12:00:00,REGULAR,002671364,000917640,03-22-10,16:00:00,REGULAR,002671651,000917719,03-22-10,20:00:00,REGULAR,002672430,000917789,03-23-10,00:00:00,REGULAR,002672473,000917795,03-23-10,04:00:00,REGULAR,002672474,000917795,03-23-10,08:00:00,REGULAR,002672516,000917876,03-23-10,12:00:00,REGULAR,002672652,000917934
② Vendor's description of columns is here (ts_Field_Description_pre-10-18-2014.txt):
Field Description
C/A,UNIT,SCP, DATE1,TIME1,DESC1,ENTRIES1,EXITS1,DATE2,TIME2,DESC2,ENTRIES2,EXITS2, DATE3,TIME3,DESC3,ENTRIES3,EXITS3,DATE4,TIME4,DESC4,ENTRIES4,EXITS4, DATE5,TIME5,DESC5,ENTRIES5,EXITS5,DATE6,TIME6,DESC6,ENTRIES6,EXITS6, DATE7,TIME7,DESC7,ENTRIES7,EXITS7,DATE8,TIME8,DESC8,ENTRIES8,EXITS8 - C/A = Control Area (A002) - UNIT = Remote Unit for a station (R051) - SCP = Subunit Channel Position represents an specific address for a device (02-00-00) - DATEn = Represents the date (MM-DD-YY) - TIMEn = Represents the time (hh:mm:ss) for a scheduled audit event - DEScn = Represent the "REGULAR" scheduled audit event (occurs every 4 hours) - ENTRIESn = The comulative entry register value for a device - EXISTn = The cumulative exit register value for a device
③ "Standardized"* dataset example:
| UNIT | C/A | SCP | DATE | TIME | DESC | ENTRIES | EXITS |
|---|---|---|---|---|---|---|---|
| R051 | A002 | 02-00-00 | 2013-01-02 | 07:00:00 | REGULAR | 3932819 | 1355901 |
Current (after to 10/18/14) observations㊉ Additional dataset example (Remote Unit/Control Area/Station Name Key)
| Remote | Booth | Station | Line Name | Division |
|---|---|---|---|---|
| R001 | A060 | WHITEHALL ST | R1 | BMT |
The list of questions is below.
🔬 Data analysis:
- Which station has the most number of units?
- What is the total number of entries & exits across the subway system for February 1, 2013?
- Let’s define the busy-ness as sum of entry & exit count. What station was the busiest on February 1, 2013? What turnstile was the busiest on that date?
- What stations have seen the most usage growth/decline in 2013?
- What dates are the least busy? Could you identify days on which stations were not operating at full capacity or closed entirely?
- Bonus: What hour is the busiest for station CANAL ST in Q1 2013?
📊 Visualization:
- Plot the daily row counts for data files in Q1 2013.
- Plot the daily total number of entries & exits across the system for Q1 2013.
- Plot the mean and standard deviation of the daily total number of entries & exits for each month in Q1 2013 for station 34 ST-PENN STA.
- Plot 25/50/75 percentile of the daily total number of entries & exits for each month in Q1 2013 for station 34 ST-PENN STA.
- Plot the daily number of closed stations and number of stations that were not operating at full capacity in Q1 2013.
Answers are in [another chapter]
Left blank intentionally
![]() |
![]() |
🗺️ Maps above are for illustrative and for informational purposes only
① We need
② We will
㊉ We assume
① What do we have?
② Sources
Use of data
Visualisations
This task is already done
We estimated five (5) main streams, "industries", areas where dataset might be utilized
📡 "Telecommunication networks" ("Telecommunication networks" (optimization of network and its' services quality, availability on-station, in-metro cars)
💺 "Transit allocation" (optimize rolling stock volumes, between-trains intervals, staff personnel over daytime; optimize volumes of buses, trams, taxis around a block)
![]() |
👷 Scheme describes core steps - ingestion (extraction), preprocessing (cleaning, feature generation), analytical reporting (notes, charts)
C/A (Booth) - UNIT (Remote) pairs can not be mapped using provided fileA049 - R088 or R612 - R057 - do not exist in a mapping bookNAENTRIES and EXITS havenegative out of [25%, 95%] percentiles for the whole datasetdespite the REGULAR desc code occurred out of 4 hours frequency ruleUsually, data cleaning is the most expert-oriented stage of work.
Questions we ask:
🧹 Cleaning:
Traditionally, feature engineering has no limits but imagination - especially when interpretation does not matter.
✔️ However, the current task is pretty defined, as well as interpretation matters here
So, we will
Merge station-related information
Remote Unit/Control Area/Station Name Key file, map with C/A - UNIT pairspreviously geocoded open-sourced file, map with station nameExtend DESC codes and TURNSTILE features
regular and all except regular DESC codesC/A with UNIT with SCP as TURNSTILE identificationExtract basic date & time features from DATE-TIME fields
DATE field TIME fieldProcess odometer-like readings of ENTRIES and EXITS
ENTRIES and EXITS volumes
① Let's import modules we need for analysis
import sys; sys.path.insert(0, '..')
img_num = 1 # constant to enumerate figures
main_title_dict_style = dict(backgroundcolor='#009aa6', color='white', size=12, weight='bold',
horizontalalignment='left')
import pandas as pd
import numpy as np
import warnings; warnings.simplefilter(action='ignore', category=FutureWarning)
import matplotlib.pyplot as plt; plt.rcParams['figure.facecolor'] = 'white'
import plotly.express as px
#!pip install PtitPrince
import ptitprince as pt
import seaborn as sns
import src.get_data as get_data
import src.feature_generation as feature_generation
import src.visualisations as visuals_prepared
import importlib
importlib.reload(get_data); importlib.reload(feature_generation); importlib.reload(visuals_prepared);
② Let's run the initial processing pipeline
✔️ Local storage is intuitive & its' structure follows DS Cookiecutter standard [link]:
├── data │ ├── external <- Data from third party sources. │ ├── interim <- Intermediate data that has been transformed. │ ├── processed <- The final, canonical data sets for modeling. │ └── raw <- The original, immutable data dump.
%%time
RUN_FROM_RAW_DATA = False
RUN_FROM_FEATURE_ENG = False
if RUN_FROM_RAW_DATA:
# Access and load data. Then make it in one analytical format
links = get_data._get_links_to_raw_data()
raw_files = get_data.download_raw_data(links, download=False)
assert len(links) == len(raw_files)
long_files = get_data.reorganize_raw_files(raw_files, custom=False)
assert len(raw_files) == len(long_files)
_ = get_data._concat_files(long_files, save_path='../data/interim/turnstile_concated_long.csv')
if RUN_FROM_FEATURE_ENG:
# Preprocess data
df = pd.read_csv('../data/interim/turnstile_concated_long.csv', index_col=0)
# By data update format, to get all Q1 2013 dates we need raw files with last dates of Q4 2012 - clean it.
df = df[~(df.DATE.str.contains('-12$') | (df.DATE.str.contains('^04-')))]
# Feature generation
# 1. Decode known C/A (Booth) - Unit (Remote) pairs to add stations; also add coordinates
# add stations using C/A and UNIT using mapping table
df = feature_generation.add_stations(df=df,
path_to_stations_data='../data/external/Remote-Booth-Station.csv')
df.to_csv('../data/interim/data_with_stations.csv')
# add geocoded coordinates to stations using mapping table
df = feature_generation.add_coordinates(df=df,
path_to_stations_coords='../data/external/geocoded.csv')
df.to_csv('../data/interim/data_with_stations_with_coords.csv')
# 2. Add binary feature for regular or all except regular DESCs
df['DESC_CATEGORY'] = df.DESC.where(cond=df.DESC == 'REGULAR', other='OTHER')
# 3. Add feature to ideitify unique devise of turnstile with known triplet of CA, Remote, SCP
df['TURNSTILE_ID'] = df['C/A'] + " " + df['UNIT'] + " " + df['SCP']
# 4. Generate time features
df = feature_generation.calc_features_from_datetime(df)
# 5. Process odometer-like (cumulative) readings for entry and exit
df = df.sort_values(['C/A','UNIT','SCP', 'AUDIT_DATE_TIME']).reset_index(drop=True)
df = feature_generation.calc_features_from_cumulative_records(df)
# 6. Save ready-to-analytics processed file with all extra features
df.to_csv('../data/processed/data_processed.csv', index=False)
df = pd.read_csv('../data/processed/data_processed.csv',
parse_dates=['AUDIT_DATE_TIME'])
③ Let's see what columns and data types we have after preprocessing
df.info()
df.head(3)

We have two groups of questions.
🔬 Data analysis
📊 Visualization
Let's answer these questions using our data.
86 ST and CANAL STbooth -to- remote units -to- station mapping, Remote-Booth-Station.xls file, to answer this questionRemote Unit/Control Area from an empiric dataset were not presented in mapping file df_aux = df.groupby('Station').UNIT.nunique()
fig, axes = plt.subplots(1, 1, figsize=(15, 5), dpi=350)
_ = visuals_prepared.plot_bar_chart_units_per_stations(df_aux, axes)
title = f'Fig.{img_num}. DA1. Which station has the most number of units?'
fig.suptitle(t=title, x=0.04, y=0.96, **main_title_dict_style); img_num += 1
axes.spines['top'].set_visible(True)
fig.tight_layout(rect=[0.0, 0.0, 1.0, 1.0])
plt.show()
display(df.groupby('Station').UNIT.nunique().nlargest(5))
row counts as the number of audit quintets (groups of five columns that describe every audit event) that we have after data transformations, preprocessing, and cleaningREGULAR and not REGULAR audit description codedf_aux = df.groupby(['DATE', 'DESC_CATEGORY']).row.count().unstack(1)
fig, axes = plt.subplots(2, 1, figsize=(15, 16), dpi=350)
axes = axes.flatten()
df_aux.REGULAR.plot.line(style='.', ax=axes[0], color=['orange'], alpha=0.5, zorder=10, label='regular scatter')
df_aux.REGULAR.rolling(7).mean().plot.line(style='--', ax=axes[0], color=['orange'], label='regular weekly mean',
alpha=0.9, zorder=10)
df_aux.plot.line(style='.', ax=axes[1], alpha=0.5, zorder=10, label='both scatter')
df_aux.rolling(7).mean().plot.line(style='--', ax=axes[1], color=['blue', 'orange'], label='both weekly mean',
alpha=0.9, zorder=10)
for indx in range(len(axes)):
axes[indx].grid(axis='both', alpha=0.20, c='gray', linestyle='--', zorder=0)
axes[indx].set(xlabel='Date'); axes[indx].set(xlabel='Date')
axis_title = 'Daily row counts for data files in Q1 2013'
axes[indx].set_title(axis_title + ': Regular audits' if indx % 2 == 0 else axis_title, loc='left')
axes[indx].spines[['left', 'right']].set_visible(False); axes[indx].spines[['top']].set_visible(True)
axes[indx].legend().remove()
#fig.legend(bbox_to_anchor=[0.53, 0.975], loc='upper left', frameon=False, labelspacing=0.1,
#labels=['REGULAR', 'OTHER', 'OTHER 7 day MA', 'REGULAR 7 day MA']
# ) # ncol=4,
handles, labels = fig.gca().get_legend_handles_labels()
by_label = dict(zip(labels, handles)) # # reversed(by_label.keys())
legend1 = fig.legend(handles=reversed(by_label.values()), labels=['REGULAR, 7 days MA', 'OTHER, 7 days MA'],
ncol=7, loc='upper left', bbox_to_anchor=(0.72, 0.978), frameon=False)
legend2 = fig.legend(handles=reversed(by_label.values()), labels=['REGULAR, 7 days MA', 'OTHER, 7 days MA'],
ncol=7, loc='upper left', bbox_to_anchor=(0.72, 0.492), frameon=False)
title = f'Fig.{img_num}. VIZ1. Plot the daily row counts for data files in Q1 2013.'
fig.suptitle(t=title, x=0.051, y=0.99, **main_title_dict_style); img_num += 1
fig.tight_layout(rect=[0.0, 0.0, 1.0, 1.0])
Two Sigma Data Clinic MTA project for reference [link].df_aux = df.groupby('DATE')[['ENTRIES_DIFF', 'EXITS_DIFF', 'BUSYNESS']].sum().loc['01-02-13']
fig, axes = plt.subplots(1, 1, figsize=(15, 5), dpi=350)
_ = visuals_prepared.plot_one_line_entry_exit_chart(entries=df_aux['ENTRIES_DIFF'],
exits=df_aux['EXITS_DIFF'],
y=1, axis=axes)
title = f'Fig.{img_num}. DA2. What is the total number of entries & \
exits across the subway system for February 1, 2013?.'
fig.suptitle(t=title, x=0.015, y=0.97, **main_title_dict_style); img_num += 1
axes.spines[['top']].set_visible(True)
fig.tight_layout(rect=[0.0, 0.0, 1.0, 1.0])
plt.show()
display(df_aux)
ENTRIES and EXITS is pretty high. However, the vast majority of sources that work with this dataset explain this with either use of emergency exit doors or systematically mismatched counters. So, let's know it.df_aux = df.groupby('DATE')[['ENTRIES_DIFF', 'EXITS_DIFF']].sum()#['ENTRIES_DIFF'].reset_index()
df_aux.index = pd.to_datetime(df_aux.index)
title = f"Fig.{img_num}. VIZ2. Daily total number of entries & exits across the system for Q1 2013"; img_num += 1
yaxis_label = "Total number"
xaxis_label = "Date"
visuals_prepared.plot_interactive_line_or_bar(df_aux, title, yaxis_label, xaxis_label, lineplot=True)
34 ST-PENN STA, R138 R293 00-00-02turnstile as a concatenated string of C/A-UNIT-SCP (maybe, a unique specific device, i.e. one tripod gate). Some people use just C/A-UNIT (maybe, a group of "devices", i.e. group of tripod gates).turnstile in general or turnstile from the first part of the question, What station was the busiest on February 1, 2013?. We used the second case - for the busiest station.Subquestion 1: What station was the busiest on February 1, 2013?
df.groupby(['DATE', 'Station'])[['ENTRIES_DIFF', 'EXITS_DIFF', 'BUSYNESS']].sum().loc['01-02-13'].\
sort_values(['BUSYNESS'], ascending=False).head(10)
Subquestion 2: What turnstile was the busiest on that date?
df[(df.Station == '34 ST-PENN STA') & (df.DATE == '01-02-13')]\
.groupby(['TURNSTILE_ID', 'Station']).BUSYNESS.sum().sort_values(ascending=False)
df['IS_WEEKEND'] = df['AUDIT_DOW'].isin([5, 6])
df_aux = df[(df['Station'] == '34 ST-PENN STA')]\
.groupby(['AUDIT_MONTH', 'DATE', 'IS_WEEKEND']).BUSYNESS.sum() #& (df.DESC == 'REGULAR')
fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(16, 9), dpi=350)
_ = visuals_prepared.plot_gradientplot_intervals(df_aux, fig, axes)
title = f'Fig.{img_num}. VIZ3. Plot the mean and standard deviation of the daily total number of entries \
& exits for each month in Q1 2013 for station 34 ST-PENN STA.'
fig.suptitle(t=title, x=0.053, y=0.98, **main_title_dict_style); img_num += 1
fig.tight_layout(rect=[0.0, 0.0, 1.0, 1.0])
Answer:
Notes / remarks:
matplotlib's documentation, [here]Q1-1.5IQR Q1 median Q3 Q3+1.5IQR |-----:-----| o |--------| : |--------| o o |-----:-----| flier <-----------> fliers IQR
df_to_plot = df_aux.reset_index()[['AUDIT_MONTH', 'BUSYNESS', 'IS_WEEKEND']]
#df_to_plot.AUDIT_MONTH = df_to_plot.AUDIT_MONTH.astype('object')
dtf = [
df_to_plot[(df_to_plot['AUDIT_MONTH'] == 1) & (df_to_plot.IS_WEEKEND)].BUSYNESS,
df_to_plot[(df_to_plot['AUDIT_MONTH'] == 1) & (~df_to_plot.IS_WEEKEND)].BUSYNESS,
df_to_plot[(df_to_plot['AUDIT_MONTH'] == 2) & (df_to_plot.IS_WEEKEND)].BUSYNESS,
df_to_plot[(df_to_plot['AUDIT_MONTH'] == 2) & (~df_to_plot.IS_WEEKEND)].BUSYNESS,
df_to_plot[(df_to_plot['AUDIT_MONTH'] == 3) & (df_to_plot.IS_WEEKEND)].BUSYNESS,
df_to_plot[(df_to_plot['AUDIT_MONTH'] == 3) & (~df_to_plot.IS_WEEKEND)].BUSYNESS,
df_to_plot[(df_to_plot['AUDIT_MONTH'] == 1)].BUSYNESS,
df_to_plot[(df_to_plot['AUDIT_MONTH'] == 2)].BUSYNESS,
df_to_plot[(df_to_plot['AUDIT_MONTH'] == 3)].BUSYNESS
]
fig, axes = plt.subplots(1, 1, figsize=(15, 10), dpi=350)
visuals_prepared.plot_boxplot_jitter_mix(df_jitter=df_to_plot, df_boxes=dtf, fig=fig, axis=axes)
title = f'Fig.{img_num}. VIZ4. Plot 25/50/75 percentile of the daily total number of entries \
& exits for each month in Q1 2013 for station 34 ST-PENN STA.'
fig.suptitle(t=title, x=0.07, y=1.01, **main_title_dict_style); img_num += 1
fig.tight_layout(rect=[0.0, 0.0, 1.0, 1.0])
del dtf
fig, ax = plt.subplots(figsize=(10, 8), dpi=350)
dy = "group"; dx="score"; ort="h"; pal = sns.color_palette(n_colors=2)
ax = pt.half_violinplot(x='BUSYNESS', y='AUDIT_MONTH',
data=df_aux.reset_index()[['AUDIT_MONTH', 'BUSYNESS', 'IS_WEEKEND']],
orient='h', bw = .1, alpha=0.8)
ax = sns.stripplot(x='BUSYNESS', y='AUDIT_MONTH',
data=df_aux.reset_index()[['AUDIT_MONTH', 'BUSYNESS', 'IS_WEEKEND']],
edgecolor="white", palette=pal,
size=3, jitter=1,
zorder=0, orient=ort,
hue='IS_WEEKEND')
plt.legend('')
fig, ax = plt.subplots(figsize=(16, 8), dpi=350)
pt.RainCloud(y='BUSYNESS', x='AUDIT_MONTH',
data=df_aux.reset_index()[['AUDIT_MONTH', 'BUSYNESS', 'IS_WEEKEND']],
hue='IS_WEEKEND',
dodge = True, jitter=0.1, move=0.05, point_size=3, width_box=0.15,
orient='h', alpha=0.9, point_c='gray',
ax=ax, zorder=10)
ax.grid(axis='x', linestyle='--', alpha=0.2, color='gray', zorder=10)
ax.spines[['left']].set_visible(False)
ax.legend().remove()
#fig.legend()
growth/decline as a median weekly change, % for the respective stationAQUEDUCT TRACK - was closed from 2011 to 2013 (closed and rebuilt to provide better access to the Resorts World Casino)BEACH 90 ST been operated not normally about that timedf_aux = df.groupby(['Station', 'AUDIT_WEEK']).BUSYNESS.sum() # mean?
stations_reindex = df.groupby(['Station']).AUDIT_WEEK.nunique()\
[df.groupby(['Station']).AUDIT_WEEK.nunique() < 13]
# because I was to lazy to work with proper datetime resamplings
# & pd.reindex is not working well with multiindex from 2014 - see these workaround lines
for station in stations_reindex.index:
tmp_ = df_aux.loc[station].reindex(range(1, 14)).to_frame()
tmp_['Station'] = station
tmp_.set_index('Station', append=True, inplace=True)
tmp_ = tmp_.reorder_levels(['Station', 'AUDIT_WEEK'])
df_aux = df_aux.drop(level='Station', index=station)
df_aux = pd.concat([df_aux, tmp_.squeeze(axis=1)])
del tmp_, stations_reindex
#break
# may use different statistics - it's up to expert
stations_growth = df_aux.groupby(['Station']).pct_change(1).groupby(['Station']).median().nlargest(10)
stations_decline = df_aux.groupby(['Station']).pct_change(1).groupby(['Station']).median().nsmallest(10)
display(stations_growth); display(stations_decline)
Subquestion 1: What stations have seen the most usage growth in 2013?
%%time
fig, axes = plt.subplots(4, 4, figsize=(16, 12), sharey=False, sharex=True, dpi=350)
axes = axes.flatten()
_ = visuals_prepared.plot_panel_bars_station_change(df_aux, stations_growth, fig, axes,
plot_MA=True, plot_pct_change=False)
[axis.remove() for axis in axes[-6:]]
#handles, labels = plt.gca().get_legend_handles_labels()
#fig.legend(handles[:0:-1], labels[:0:-1], ncol=2, frameon=False, loc='upper right', bbox_to_anchor=[0.96, 1.025])
fig_sub_caption = '''Top-10 Stations by busy-ness change, week to previous week. \
Colored boxes are median* changes. Bars are weekly totals. Lines are MA(1, 2, 3)*'''
fig.text(x=0.05, y=0.99, s=fig_sub_caption, fontsize=12)
title = f"""Fig.{img_num}. DA4. What stations have seen the most usage growth* in 2013?"""; img_num += 1
fig.suptitle(title, **main_title_dict_style, x=0.055, y=1.025)
fig.tight_layout(rect=[0.0, 0.0, 1.0, 1.0])
Subquestion 2: What stations have seen the most usage decline in 2013?
%%time
fig, axes = plt.subplots(4, 4, figsize=(16, 12), sharey=False, sharex=True, dpi=350)
axes = axes.flatten()
_ = visuals_prepared.plot_panel_bars_station_change(df_aux, stations_decline, fig, axes,
plot_MA=True, plot_pct_change=False)
[axis.remove() for axis in axes[-6:]]
fig_sub_caption = '''Top-10 Stations by busy-ness change, week to previous week. \
Colored boxes are median* changes. Bars are weekly totals. Lines are MA(1, 2, 3)*'''
fig.text(x=0.05, y=0.99, s=fig_sub_caption, fontsize=12)
title = f"""Fig.{img_num}. DA4. What stations have seen the most usage decline* in 2013?"""; img_num += 1
fig.suptitle(title, **main_title_dict_style, x=0.055, y=1.025)
fig.tight_layout(rect=[0.0, 0.0, 1.0, 1.0])
the least busy?not operating at full capacity or closed entirely?the least busy, we used a quantile approach - dates with less than 25% of busy-nessnot operating at full capacity or closed entirely, we used the percentile approach - with less than 5% of busy-nessAQUEDUCT TRACK, BEACH 90 ST)Subquestion 1: What dates are the least busy?
percentile = 0.25
df_aux = df.groupby(['DATE', 'AUDIT_DOW']).BUSYNESS.sum()
df_aux[df_aux <= df_aux.quantile(percentile)]
Subquestion 2: Could you identify days on which stations were not operating at full capacity or closed entirely?
percentile = 0.05
df_stationwise_aggregation = df.groupby(['Station', 'Line Name', 'Division', 'DATE']).BUSYNESS.sum().reset_index(level=[1])
df_stationwise_quantile = df.groupby(['Station', 'Line Name', 'Division', 'DATE']).BUSYNESS.sum().groupby(['Station']).quantile(percentile)
df_aux = df_stationwise_aggregation.join(df_stationwise_quantile.to_frame(), rsuffix='_target_percentile')
df_aux = df_aux.reset_index()
df_aux['DATE'] = pd.to_datetime(df_aux['DATE'])
del df_stationwise_aggregation, df_stationwise_quantile
df_aux[df_aux.BUSYNESS <= df_aux.BUSYNESS_target_percentile].DATE.value_counts().nlargest(5)
Notes / remarks:
not operating at full capacity or closed entirely, we used the percentile approach - with less than 5% of busy-ness for that stationTo define closed entirely, we used heuristics - with busy-ness equals to 0 (zero)
Sometimes "calendar heatmap" (see: link or link) is more interesting and less complex to read comparing to bar charts for this type of question
df_plot = df_aux[df_aux.BUSYNESS <= df_aux.BUSYNESS_target_percentile]
df_plot = df_plot.groupby('DATE').Station.nunique().rename('Number of stations')
title = f"""Fig.{img_num}. VIZ5. Daily number of closed* stations and number of stations \
that were not operating at full capacity* in Q1 2013"""; img_num += 1
_ = visuals_prepared.plot_heatmap_calendar(df_plot, title, main_title_dict_style);
del df_plot
df_plot = df_aux[df_aux.BUSYNESS == 0]
df_plot = df_plot.groupby('DATE').Station.nunique().rename('Number of stations')
title = f'Fig.{img_num}. VIZ5. Daily number of closed* stations in Q1 2013'; img_num += 1
_ = visuals_prepared.plot_heatmap_calendar(df_plot, title, main_title_dict_style);
del df_plot
# df_aux[df_aux.BUSYNESS == 0].Station.value_counts()
busiest hour we used circular $lags@\{0, -1, -2, -3, +1 \text{ hours}\}$ (to handle audit occurrences, because we didn't resample dataset) and took mean of lags (to "smooth" the series)Bar plot view
days_of_week = {"Weekdays": [0, 1, 2, 3, 4], "Weekends": [5, 6]}
lags = [-1, -2, -3, 1]
fig, axes = plt.subplots(1, 2, figsize=(14, 5), sharey=True, dpi=350)
axes = axes.flatten()
# Plots hourly `busyness metric` (of subway usage) in ordinal barplot.
for indx, (title, days_of_week) in enumerate(days_of_week.items()):
df_aux = pd.Series(pd.RangeIndex(0, 24), name='BUSYNESS')
df_aux += df[(df['Station'] == 'CANAL ST')
& (df.AUDIT_DOW.isin(days_of_week)) #& (df.DESC == "REGULAR")
].groupby(['AUDIT_HOUR']).BUSYNESS.mean()
df_aux = df_aux.to_frame()
for lag in lags:
df_aux[f'lag_{lag}'] = np.roll(df_aux.BUSYNESS, lag)
df_aux.mean(axis=1).plot.bar(ax=axes[indx], color='#009aa6', zorder=10, label='Hourly mean bar chart')
# Beautify
axes[indx].margins(y=0.1, tight=False); axes[indx].set_ylim(ymin=-3)
axes[indx].grid(axis='y', alpha=0.2, c='gray', linestyle='--', zorder=0)
ticks = ["{}:00".format(x) for x in range(24)]
axes[indx].set_xticklabels(ticks)
axes[indx].set_title(title, loc='left', pad=-0.05)
if indx % 2 == 0: axes[indx].spines[['top', 'left']].set_visible(False)
else: axes[indx].spines[['top', 'right']].set_visible(False)
title = f'Fig.{img_num}. DA6. What hour is the busiest* for station CANAL ST in Q1 2013?'
fig.suptitle(t=title, x=0.016, y=0.98, **main_title_dict_style); img_num += 1
fig.tight_layout(rect=[0.0, 0.0, 1.0, 1.0])
Polar bar plot view
days_of_week = {"Weekdays": [0, 1, 2, 3, 4], "Weekends": [5, 6]}
lags = [-1, -2, -3, 1]
fig, axes = plt.subplots(1, 2, figsize=(16, 12), sharey=True,
subplot_kw={'projection': "polar"}, dpi=350)
for indx, (title, days_of_week) in enumerate(days_of_week.items()):
df_aux = pd.Series(pd.RangeIndex(0, 24), name='BUSYNESS')
df_aux += df[(df['Station'] == 'CANAL ST') & (df.AUDIT_DOW.isin(days_of_week))].\
groupby(['AUDIT_HOUR']).BUSYNESS.mean()
df_aux = df_aux.to_frame()
for lag in lags:
df_aux[f'lag_{lag}'] = np.roll(df_aux.BUSYNESS, lag)
_ = visuals_prepared.plot_clock_metric_hourly(df_aux.mean(axis=1), "#009aa6", title, axes[indx])
#axes[indx].margins(y=0.1, tight=False)
axes[indx].yaxis.set_tick_params(labelbottom=True)
axes[indx].spines['polar'].set_visible(False)
title = f'Fig.{img_num}. DA6. What hour is the busiest* for station CANAL ST in Q1 2013?'
fig.suptitle(t=title, x=0.01, y=0.87, **main_title_dict_style); img_num += 1
fig.tight_layout(rect=[0.0, 0.0, 1.0, 1.0])
